Crawling Microblog by Common-Designed Software
نویسندگان
چکیده
Amount of microblogs data is needed to be crawled for research, business analyzing, and so on. However, a lot of dynamic Web techniques are used in microblog Web pages. That makes it hard to crawl data by parsing the contents of Web pages for traditional Web page crawlers. Fortunately, microblogs provide APIs. Well-structured data can be returned to users simply by accessing those APIs in form of URLs. Basing on that mechanism, researchers have obtained some data from microblogs to research. Nevertheless, no common software for crawling microblog has been published up to now. Everyone has to start designing a microblog crawler from very beginning. A common software architecture based on microblog APIs for microblog crawler is proposed in this paper, which is named as MBCrawler. Its structure, architecture, and key classes are introduced. It can be seen that MBCrawler is modular and scalable. By implementing a real microblog crawler for Sina Weibo, it is shown that MBCrawler can fit specific features of different microblogs.
منابع مشابه
Piecing together the puzzle: Improving event content coverage for real-time sub-event detection using adaptive microblog crawling
In an age when people are predisposed to report real-world events through their social media accounts, many researchers value the benefits of mining user generated content from social media. Compared with the traditional news media, social media services, such as Twitter, can provide more complete and timely information about the real-world events. However events are often like a puzzle and in ...
متن کاملLearning to Rank Microblog Posts for Real-Time Ad-Hoc Search
Microblogging websites have emerged to the center of information production and diffusion, on which people can get useful information from other users’ microblog posts. In the era of Big Data, we are overwhelmed by the large amount of microblog posts. To make good use of these informative data, an effective search tool is required specialized for microblog posts. However, it is not trivial to d...
متن کاملPKU_ICST at TREC 2011 Microblog Track
This paper describes the PKU_ICST participation in the TREC 2011 Microblog track. In the first year of Microblog track, we designed a group of experiments to verify whether external resources and future resources would improve the performance of our system. Moreover, given that microblog track is a real-‐‐time adhoc task, we explored an ap...
متن کاملThread Cleaning and Merging for Microblog Topic Detection
As a classic natural language processing technology, topic detection recently attracts more research interests due largely to the rapid development of microblog. The most challenging issue in microblog topic detection is sparse data problem. In this paper, the temporal-author-topic (TAT) model is designed to accomplish microblog topic detection in two phases. In the first phase, the TAT model i...
متن کاملTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
Twitter is the largest source of microblog text, responsible for gigabytes of human discourse every day. Processing microblog text is difficult: the genre is noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets and other microblog text. We present TwitIE, an open-source NLP pipeline customised to microblog text at every...
متن کامل